Skip to content

[SPARK-34346][CORE][SQL][3.0] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31492

Closed
yaooqinn wants to merge 1 commit intoapache:branch-3.0from
yaooqinn:SPARK-34346-30
Closed

[SPARK-34346][CORE][SQL][3.0] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression#31492
yaooqinn wants to merge 1 commit intoapache:branch-3.0from
yaooqinn:SPARK-34346-30

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 5, 2021

Backport #31460 to 3.0

What changes were proposed in this pull request?

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the hive-site.xml for their hive jobs and make a copy to SPARK_HOME/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use spark.buffer.size(65536) to reset io.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

  1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop

  2. This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change

Does this PR introduce any user-facing change?

this pr restores silent user face change

How was this patch tested?

new tests

….size will override by loading hive-site.xml accidentally may cause perf regression

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

bugfix for configuration behavior and fix performance regression by that behavior change

this pr restores silent user face change

new tests

Closes #31460 from yaooqinn/SPARK-34346.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 5, 2021

cc @cloud-fan @maropu @HyukjinKwon @dongjoon-hyun thanks

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39521/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39521/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @yaooqinn .
Merged to branch-3.0.

dongjoon-hyun pushed a commit that referenced this pull request Feb 5, 2021
….size will override by loading hive-site.xml accidentally may cause perf regression

Backport  #31460 to 3.0

### What changes were proposed in this pull request?
In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`.

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop`

2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml

### Why are the changes needed?

bugfix for configuration behavior and fix performance regression by that behavior change
### Does this PR introduce _any_ user-facing change?

this pr restores silent user face change
### How was this patch tested?

new tests

Closes #31492 from yaooqinn/SPARK-34346-30.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134938 has finished for PR 31492 at commit 1157fd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants